Local LLM

This guide walks you through setting up your own Large Language Model (LLM) server using Ollama on an Ubuntu VM with NVIDIA GPU passthrough in Proxmox.

What is This Setup?

This configuration allows you to:

Run LLMs locally on your own hardware with GPU acceleration
Host models like Llama, Mistral, or GPT-OSS for private AI inference
Achieve faster response times compared to CPU-only inference
Maintain data privacy by keeping everything on your infrastructure
Access your LLM via web UI similar to ChatGPT

Unlike cloud-based LLM services, this setup gives you complete control over your models, data, and costs.

What is GPU Passthrough?

GPU passthrough (also called PCIe passthrough) allows a virtual machine to directly access a physical GPU, bypassing the hypervisor layer. This means:

Near-native performance: Your VM gets almost the same GPU performance as bare metal
Direct hardware access: The VM controls the GPU as if it were physically installed
Exclusive access: Only one VM can use the passed-through GPU at a time
Required for GPU compute: Essential for running LLMs with GPU acceleration in VMs

Without GPU passthrough, your LLM would run on CPU only, which is 10-100x slower than GPU-accelerated inference.

Prerequisites

Before starting, you need:

A Proxmox server with an NVIDIA GPU installed
An Ubuntu Server VM (22.04 or later recommended)
Docker installed on the VM
Basic familiarity with Linux command line
SSH access to your VM
Sufficient VRAM on your GPU

GPU recommendations:

Small models (7B parameters): 8GB VRAM minimum
Medium models (13B-20B): 16GB+ VRAM
Large models (30B+): 24GB+ VRAM

Step 1: Configure GPU Passthrough in Proxmox

Follow this video tutorial to set up GPU passthrough from your Proxmox host to your Ubuntu VM: 📹 Proxmox GPU Passthrough Guide The video covers:

Enabling IOMMU in BIOS
Configuring Proxmox for PCIe passthrough
Adding the GPU to your VM
Verifying the setup

After completing the setup, verify GPU is visible in your VM:

lspci | grep -i nvidia

Expected output:

01:00.0 VGA compatible controller: NVIDIA Corporation ...
01:00.1 Audio device: NVIDIA Corporation ...

If you see NVIDIA devices listed, passthrough is working correctly.

Step 2: Install NVIDIA Drivers

The NVIDIA drivers enable your Ubuntu system to communicate with the GPU hardware. Follow this guide for driver installation on Ubuntu: 📖 NVIDIA Driver Installation Guide Quick verification after installation:

nvidia-smi

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8    15W / 250W |      0MiB / 16384MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

This confirms your GPU is detected and the driver is working.

Step 3: Install CUDA Toolkit

CUDA is NVIDIA’s parallel computing platform required for GPU-accelerated applications. Download and install CUDA from the official source: 📦 CUDA Toolkit Downloads Select your operating system, architecture, and distribution to get the appropriate installation commands.

Step 4: NVIDIA Container Toolkit

Install NVIDIA Container Toolkit: Follow the official installation guide to enable GPU access in Docker containers: 📖 NVIDIA Container Toolkit Installation Verify GPU access in Docker:

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

You should see the same nvidia-smi output as before, confirming Docker can access the GPU.

Step 5: Deploy Ollama and Open WebUI

Create a directory for your setup:

mkdir -p ~/llm-server
cd ~/llm-server

Create docker-compose.yml:

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    restart: always
    environment:
      OLLAMA_KEEP_ALIVE: -1
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: always
    environment:
      ENABLE_ADMIN_CHAT_ACCESS: false
    ports:
      - "80:8080"
    volumes:
      - open-webui:/app/backend/data
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  ollama:
  open-webui:

What these services do:

Ollama Service

OLLAMA_KEEP_ALIVE: -1: Keeps models loaded in GPU memory indefinitely for instant responses
Port 11434: API endpoint for model inference
Volume: Persists downloaded models between restarts
GPU reservation: Ensures the container can access all available GPUs

Open WebUI Service

Port 80: Web interface accessible at http://your-vm-ip
ENABLE_ADMIN_CHAT_ACCESS: false: Disables admin user from accessing all chats (i mean.. its kinda creepy to check your employees chats)
host.docker.internal: Allows the web UI to communicate with Ollama
Volume: Stores user data, conversations, and settings

Start the services:

docker compose up -d

Verify both containers are running:

docker ps

Expected output:

CONTAINER ID   IMAGE                              STATUS         PORTS
abc123def456   ollama/ollama                      Up 2 minutes   0.0.0.0:11434->11434/tcp
def456abc789   ghcr.io/open-webui/open-webui:main Up 2 minutes   0.0.0.0:80->8080/tcp

Step 6: Access Open WebUI and Download Models

Open your web browser and navigate to:

http://your-vm-ip

You’ll see the Open WebUI interface. On first access, you’ll need to create an admin account. Download your first model:

Click on your profile icon in the top right
Go to Admin Panel → Settings → Models
In the “Pull a model from Ollama.com” field, enter a model name
Click the download button

Or download via command line:

docker exec -it ollama ollama pull llama3.2:7b

The model will appear in Open WebUI’s model selector once downloaded.

Step 7: Verify GPU Acceleration

Check that your model is running on the GPU:

docker exec -it ollama nvidia-smi

Expected output showing GPU memory usage:

+-----------------------------------------------------------------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 45%   65C    P2   180W / 250W |   7234MiB / 16384MiB |     95%      Default |
+-----------------------------------------------------------------------------+

Check loaded models:

docker exec -it ollama ollama ps

Expected output:

NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
llama3.2:7b     a80c4f17acd5    4.7 GB   100% GPU     4096       Forever

Critical indicators:

PROCESSOR: 100% GPU ✅ - Model is running on GPU (good!)
PROCESSOR: 100% CPU ❌ - Model fell back to CPU
UNTIL: Forever ✅ - Model stays loaded (due to OLLAMA_KEEP_ALIVE: -1)

Step 8: Customize Model Context Length

The context window determines how much text the model can remember in a conversation. Larger contexts allow for longer discussions but use more VRAM. Access Ollama’s interactive mode:

docker exec -it ollama ollama run llama3.2:7b

Set a custom context length:

>>> /set parameter num_ctx 10000
Set parameter 'num_ctx' to '10000'

⚠️ Important: These changes are temporary and lost when the model unloads! Make context changes permanent:

>>> /set parameter num_ctx 10000
Set parameter 'num_ctx' to '10000'
>>> /save llama3.2:7b-10k
Created new model 'llama3.2:7b-10k'
>>> /bye

What this does:

Sets context to 10,000 tokens
Saves as a new model variant with the custom context
The new model persists these settings permanently

⚠️ VRAM Warning: Setting context too high can exhaust your GPU memory, causing the model to fall back to CPU (much slower). Always monitor VRAM usage with nvidia-smi after changing context length. Verify your custom model:

docker exec -it ollama ollama ps

Expected output:

NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
llama3.2:7b-10k    a80c4f17acd5    4.7 GB   100% GPU     10000      Forever

Your custom model will now appear in Open WebUI’s model selector. 📖 If you are having problems checkout the Official Ollama Troubleshooting Guide

Getting Started

Deployment

Kubernetes

AI

What is This Setup?

What is GPU Passthrough?

Prerequisites

Step 1: Configure GPU Passthrough in Proxmox

Step 2: Install NVIDIA Drivers

Step 3: Install CUDA Toolkit

Step 4: NVIDIA Container Toolkit

Step 5: Deploy Ollama and Open WebUI

Ollama Service

Open WebUI Service

Step 6: Access Open WebUI and Download Models

Step 7: Verify GPU Acceleration

Step 8: Customize Model Context Length

Getting Started

Deployment

Kubernetes

AI

​What is This Setup?

​What is GPU Passthrough?

​Prerequisites

​Step 1: Configure GPU Passthrough in Proxmox

​Step 2: Install NVIDIA Drivers

​Step 3: Install CUDA Toolkit

​Step 4: NVIDIA Container Toolkit

​Step 5: Deploy Ollama and Open WebUI

​Ollama Service

​Open WebUI Service

​Step 6: Access Open WebUI and Download Models

​Step 7: Verify GPU Acceleration

​Step 8: Customize Model Context Length

What is This Setup?

What is GPU Passthrough?

Prerequisites

Step 1: Configure GPU Passthrough in Proxmox

Step 2: Install NVIDIA Drivers

Step 3: Install CUDA Toolkit

Step 4: NVIDIA Container Toolkit

Step 5: Deploy Ollama and Open WebUI

Ollama Service

Open WebUI Service

Step 6: Access Open WebUI and Download Models

Step 7: Verify GPU Acceleration

Step 8: Customize Model Context Length